knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(viridis)
library(leaflet)
lat_long = read.csv("./data/lat-long.csv")
devtools::install_github("ijlyttle/vembedr")
library("vembedr")
\(~\)
The motivation of this project was to use the data, the New York City (NYC) Department of Health and Mental Hygience (DOHMH) Dog Licensing Dataset. This dataset spans the five boroughs of NYC and will help to deepen our understanding of the current patterns and trends among canines in NYC.
\(~\)
We sought to understand the current landscape of licensed dogs in New York City. Additionally, we were also interested in comparing these demographic factors among the five boroughs. Furthermore, we wanted to dive into the current patterns of licensing between 2017 - 2019 and examine if there were variations by year or month of license. Finally, we sought to find ways of predicting a dog’s age based on the borough and the gender of the dog. Based on this information, our questions are:
\(~\)
We utilized the NYC Dog Licensing dataset. All dog owners who reside in NYC are required by law to license their dogs. These licenses can range from one to three years; owners are able to choose when they purchase one. This is a publicly available dataset sourced from the DOHMH Dog Licensing System where owners apply for and renew dog licenses. This dataset was created on September 19, 2017 and was last updated on July 25, 2019. There are 10 variables in this dataset with a raw number of 345,727 observations. The variables provided in the raw dataset are as follows:
\(~\)
\(~\)
lat_lon = read.csv("./data/lat-long.csv")
dogz = read.csv("./data/NYC_Dog_Licensing_Dataset.csv") %>%
janitor::clean_names() %>%
filter(breed_name != "Unknown",
extract_year-animal_birth_month < 20) %>%
mutate(animal_name = str_to_sentence(animal_name),
borough = ifelse(zip_code %in% c(10001:10282),"Manhattan", borough),
borough = ifelse(zip_code %in% c(10301:10314), "Staten Island", borough),
borough = ifelse(zip_code %in% c(10451:10475), "Bronx", borough),
borough = ifelse(zip_code %in% c(11004,11101:11106,11109,11351,11354:11375,11377:11379,11385,11411:11423,11426:11430,11432:11436,11691:11694,11697), "Queens", borough),
borough = ifelse(zip_code %in% c(11201,11203:11226,11228:11239,11241:11243,11249,11252,11256), "Brooklyn", borough),
borough = as.factor(borough),
breed_name = as.character(breed_name),
license_issued_date = as.Date(license_issued_date, format = "%m/%d/%Y"),
license_expired_date = as.Date(license_expired_date, format = "%m/%d/%Y"),
license_length = year(license_expired_date) - year(license_issued_date),
dog_age = 2019 - animal_birth_month,
breed_name = ifelse(breed_name == "American Pit Bull Terrier/Pit Bull", "American Pit Bull Mix / Pit Bull Mix", breed_name),
breed_name = ifelse(breed_name == "American Pit Bull Terrier/Pit Bull", "American Pit Bull Mix / Pit Bull Mix", breed_name),
breed_name = ifelse(breed_name == "Australian Cattle Dog", "Australian Cattledog", breed_name),
breed_name = ifelse(breed_name == "Bassett Hound", "Basset Hound", breed_name),
breed_name = ifelse(breed_name == "Brittany", "Brittany Spaniel", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Black and Tan", "Black and Tan Coonhound", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Black and Tan", "Black and Tan Coonhound", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Blue Tick", "Bluetick Coonhound", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Treeing Walker", "Treeing Walker Coonhound", breed_name),
breed_name = ifelse(breed_name == "Cotton de Tulear", "Coton de Tulear", breed_name),
breed_name = ifelse(breed_name == "Japanese Chin/Spaniel", "Japanese Chin", breed_name),
breed_name = ifelse(breed_name == "Jindo Dog, Korea", "Jindo", breed_name),
breed_name = ifelse(breed_name == "Mastiff, French (Dogue de Bordeaux)", "Dogue de Bordeaux", breed_name),
breed_name = ifelse(breed_name == "Pharoh hound", "Pharaoh Hound", breed_name),
breed_name = ifelse(breed_name == "Schipperke", "Schipperkee", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Miniature", "Miniature Schnauzer", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Miniature Crossbreed", "Miniature Schnauzer, Crossbreed", breed_name),
breed_name = ifelse(breed_name == "Welsh Corgi, Cardigan", "Cardigan Welsh Corgi", breed_name),
breed_name = ifelse(breed_name == "Welsh Corgi, Pembroke", "Pembroke Welsh Corgi", breed_name),
breed_name = ifelse(breed_name == "Bull Dog, French", "French Bulldog", breed_name),
breed_name = ifelse(breed_name == "Collie, Bearded", "Bearded Collie", breed_name),
breed_name = ifelse(breed_name == "Bullmastiff", "Mastiff, Bull", breed_name),
breed_name = ifelse(breed_name == "Neapolitan Mastiff", "Mastiff, Neapolitan", breed_name),
breed_name = ifelse(breed_name == "Tibetan Mastiff", "Mastiff, Tibetan", breed_name),
breed_name = ifelse(breed_name == "Pointer, German Shorthaired", "German Shorthaired Pointer", breed_name),
breed_name = ifelse(breed_name == "Pointer, German Wirehaired", "German Wirehaired Pointer", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Giant", "Giant Schnauzer", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Standard", "Standard Schnauzer", breed_name),
breed_name = ifelse(breed_name == "Chinese Shar-Pei", "Shar-Pei, Chinese", breed_name)) %>%
drop_na(borough) %>%
rename(animal_birth_year = animal_birth_month, license_id = row_number) %>%
select(-extract_year)
dogz_1 =
dogz %>%
distinct(animal_name, animal_gender, animal_birth_year, breed_name, .keep_all = TRUE) %>%
select (-license_issued_date, -license_expired_date, -license_length)
distinct_dogz =
left_join(dogz_1, lat_lon, by = "zip_code") %>%
select(-City,-State,-Timezone,-Daylight.savings.time.flag,-geopoint) %>%
janitor::clean_names()
The dataset was cleaned and adjusted for this analysis. All dogs whose breed were listed as “unknown” were removed, as well as any dogs who were listed as over 20 years old at the time of their license. Originally there were no boroughs listed in this dataset, so we used the available zip codes to assign boroughs to each record. Records whose zip code did not correlate to a borough were dropped. A variable license_length was added to calculate the length in years of each license. There were slight differences in the way breed names were recorded, these were adjusted to be uniform.
Variables were changed to the appropriate data type where necessary.
The main database included the same dog multiple times if it had more than one license. We created a subset to catalog only the distinct dogs present in our dataset. To identify individual dogs we matched records on animal name, gender, birth year, and breed name. We also added latitude and longitude data that corresponds to zip code for mapping purposes.
\(~\)
plot_2=
distinct_dogz %>%
filter(animal_gender == "M" | animal_gender == "F") %>%
ggplot(aes(x = borough, fill = animal_gender)) +
geom_histogram(position = "dodge", stat = "count") +
labs(
title = "Gender of dogs by borough",
x = "New York Borough",
y = "Number of dogs"
)
plot_2

Based on this bar graph plotting borough by frequency of dogs, we can see that overall, Manhattan has the most dogs (over 20,000 dogs for each gender) followed by Brooklyn, Queens, The Bronx, and Staten Island (with about 5,000 dogs for each gender). Additionally, among all boroughs there are more male dogs than there are female dogs.
\(~\)
plot_3=
distinct_dogz %>%
filter(dog_age < 20) %>%
ggplot(aes(x = borough, y = dog_age)) +
geom_boxplot (color = "purple") +
labs(
title = "Age distribution of dogs by borough",
x = "New York Borough",
y = "Dog's Age"
)
plot_3

Based on this box and whisker plot, which looks at dog’s age in years per borough, we can see that overall, the median age for dogs is about 6 years. The Bronx, Brooklyn, and Queens have similar distributions which are right skewed while Manhattan has a smaller range of values. Staten Island has the oldest dogs ranging above 10 years with no outliers shown.
\(~\)
table2 = distinct_dogz %>%
group_by(dog_age) %>%
count(breed_name) %>%
top_n(3)
knitr::kable(table2)
| dog_age | breed_name | n |
|---|---|---|
| 1 | Labrador Retriever Crossbreed | 231 |
| 1 | Shih Tzu | 400 |
| 1 | Yorkshire Terrier | 569 |
| 2 | American Pit Bull Mix / Pit Bull Mix | 448 |
| 2 | Shih Tzu | 564 |
| 2 | Yorkshire Terrier | 813 |
| 3 | American Pit Bull Mix / Pit Bull Mix | 697 |
| 3 | Shih Tzu | 579 |
| 3 | Yorkshire Terrier | 754 |
| 4 | American Pit Bull Mix / Pit Bull Mix | 896 |
| 4 | Shih Tzu | 472 |
| 4 | Yorkshire Terrier | 548 |
| 5 | American Pit Bull Mix / Pit Bull Mix | 956 |
| 5 | Chihuahua | 529 |
| 5 | Shih Tzu | 507 |
| 6 | American Pit Bull Mix / Pit Bull Mix | 778 |
| 6 | Chihuahua | 542 |
| 6 | Yorkshire Terrier | 481 |
| 7 | American Pit Bull Mix / Pit Bull Mix | 619 |
| 7 | Chihuahua | 445 |
| 7 | Shih Tzu | 471 |
| 8 | American Pit Bull Mix / Pit Bull Mix | 526 |
| 8 | Shih Tzu | 404 |
| 8 | Yorkshire Terrier | 452 |
| 9 | American Pit Bull Mix / Pit Bull Mix | 439 |
| 9 | Shih Tzu | 431 |
| 9 | Yorkshire Terrier | 440 |
| 10 | Chihuahua | 383 |
| 10 | Shih Tzu | 393 |
| 10 | Yorkshire Terrier | 406 |
| 11 | Chihuahua | 368 |
| 11 | Shih Tzu | 381 |
| 11 | Yorkshire Terrier | 390 |
| 12 | Chihuahua | 333 |
| 12 | Shih Tzu | 364 |
| 12 | Yorkshire Terrier | 391 |
| 13 | Chihuahua | 260 |
| 13 | Shih Tzu | 310 |
| 13 | Yorkshire Terrier | 290 |
| 14 | Chihuahua | 238 |
| 14 | Shih Tzu | 270 |
| 14 | Yorkshire Terrier | 254 |
| 15 | Chihuahua | 230 |
| 15 | Shih Tzu | 188 |
| 15 | Yorkshire Terrier | 207 |
| 16 | Chihuahua | 147 |
| 16 | Shih Tzu | 160 |
| 16 | Yorkshire Terrier | 159 |
| 17 | Chihuahua | 105 |
| 17 | Shih Tzu | 141 |
| 17 | Yorkshire Terrier | 106 |
| 18 | Chihuahua | 80 |
| 18 | Shih Tzu | 90 |
| 18 | Yorkshire Terrier | 71 |
| 19 | Chihuahua | 52 |
| 19 | Shih Tzu | 76 |
| 19 | Yorkshire Terrier | 51 |
| 20 | Chihuahua | 25 |
| 20 | Jack Russell Terrier | 23 |
| 20 | Shih Tzu | 23 |
| 21 | Jack Russell Terrier | 10 |
| 21 | Maltese | 9 |
| 21 | Shih Tzu | 12 |
| 22 | Bichon Frise | 7 |
| 22 | Chihuahua | 4 |
| 22 | Maltese | 4 |
| 22 | Poodle, Toy | 5 |
In this table, we see the top 3 dog breeds for each dog aged 1 - 22 in the dataset. We can see that there are more larger dog breeds appearing as the top breed in some of the younger age categories. After 9 years, all of the top 3 dog breeds for each age consist of small dog breeds. This is consistent with literature findings that smaller dog breeds tend to live longer lives than larger dog breeds. In addition, we find that several dog breeds outshine others in being the top dog breed for many of the age categories including Chihuahua, Yorkshire Terrier, and American Pit Bull Mix / Pit Bull Mix.
\(~\)
plot_4=
distinct_dogz %>%
group_by(borough) %>%
count(breed_name) %>%
top_n(5) %>%
ggplot(aes(x = breed_name, y = n)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_grid(~borough, scales = "free", space = "free") +
labs(
title = "Top 5 breeds of dogs by borough",
x = "Dog Breed",
y = "Total Count"
)
plot_4

Based on this plot, plotting dog breed by the total count, the most common dog breeds are American Pit Bull Mix/Pit Bull Mix, Chihuahua, Maltese, Shih Tzu, and Yorkshire Terrier. American Pit Bull Mix/Pit Bull Mix has the highest observed frequency in Brooklyn (n =2000) while the lowest in Staten Island (n = 600). Additionally, in Manhattan, Chihuahua’s, and Yorkshire Terriers have the highest observed frequency.
\(~\)
plot_5.5=
dogz %>%
mutate(license_issued_date = floor_date(as_date(license_issued_date), "month")) %>%
group_by(license_issued_date) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = license_issued_date, y = n_obs, color = n_obs)) +
geom_point() + geom_line() +
labs(title = "Number of licenses over time",
x = "License issue date",
y = "Number of licenses")
plot_5.5

Based on this line graph, license issue date by the number of licenses, the highest frequency of licences were issued between 2016 - 2018 while the fewest number of licenses were issued from 2015-2016. This could be due to the fact that because the dataset was capturing the dog licensed population from 2017-2019, a license that was purchased before those dates still could be active in the study time frame.
\(~\)
plot_5.51=
dogz %>%
mutate(license_issued_date = month(license_issued_date),
license_issued_date = month.abb[license_issued_date],
license_issued_date = factor(license_issued_date, c("Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul","Aug", "Sep", "Oct","Nov", "Dec"), ordered = TRUE)) %>%
group_by(license_issued_date) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = license_issued_date, y = n_obs, color = n_obs)) +
geom_point() + geom_line(group=1) +
labs(title = "Number of licenses issued per month",
x = "License issue date",
y = "Number of licenses")
plot_5.51

Based on this line graph, there is an increasing trend present in regards to the number of licenses issued per month. From January and March there is a steep increase, while there is a more gradual increase between March and June. There is a spike from June to July and a plateau from July to August. From August, there is a steady decrease until November, which then changes to an increase in December. There many plausible explanations for this phenomenon, but we hypothesize that it could be due to the fact that in the warmer months, individuals are more likely to want to adopt/buy a dog where it is easier to upkeep a dog as opposed to the colder winter months of the Northeast.
\(~\)
top_dogz =
distinct_dogz %>%
filter(animal_name != "Unknown",
animal_name != "Name not provided") %>%
group_by(animal_name) %>%
summarize(n_obs = n()) %>%
arrange(desc(n_obs)) %>%
top_n(10)
knitr::kable(top_dogz)
| animal_name | n_obs |
|---|---|
| Max | 720 |
| Charlie | 696 |
| Bella | 667 |
| Coco | 633 |
| Lucy | 568 |
| Lola | 539 |
| Rocky | 535 |
| Bailey | 529 |
| Buddy | 503 |
| Lucky | 497 |
In this table, we see the 10 most popular dog names among all of the data in this dataset. Max ranks in first place with 720 counts among the entire dataset. It is interesting to note that several of the dog names could be gender neutral among this top 10 list.
\(~\)
plot_6 =
distinct_dogz %>%
filter(animal_name != "Unknown",
animal_name != "Name not provided",
animal_name != "Name",
animal_birth_year != 1997) %>%
group_by(animal_birth_year, animal_name) %>%
summarize(n_obs = n()) %>%
top_n(1) %>%
ggplot(aes(x=animal_birth_year, y=n_obs, group = animal_name, color = animal_name)) + geom_point() + geom_line() +
labs(title = "Most popular dog name by birth year",
x = "Birth year",
y = "Number of dogs")
plot_6

Based on this line graph, we can see the trend of dog names over time (2000-2015). Names like Coco, Max, and Bella had an increase between 2000 - 2010 while names like Charlie gained popularity starting around 2007 until about 2016. Additionally, while names like Lucky and Lucy were popular, they only spanned a short duration of about a year. Interestingly, there has been a decrease in the popularity of the name Luna since 2016.
\(~\)
pal = colorNumeric(
palette = "viridis",
domain = distinct_dogz$zip_code)
distinct_dogz %>%
group_by(breed_name) %>%
mutate(click_label = str_c("<b> year(s)", dog_age, "</b><br>", animal_birth_year, "born <br>", animal_gender, " gender")) %>%
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(~longitude, ~latitude, radius = .1, color = ~pal(zip_code), popup = ~click_label)